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Abstract 

We present a rnew iterative algorithnn for the nnolecular distarnce geonnetry problem with inaccurate and sparse 
data, which is based on the solution of linear systems, maximum cliques, and a minimization of nonlinear least- 
squares function. Computational results with real protein structures are presented in order to validate our 
approach. 



Background 

The knowledge of the protein structure is very important 
to understand its function and to analyze possible interac- 
tions with other proteins. Different methods can be applied 
to acquire protein structural information. Until 1984, the 
X-ray crystallography was the ultimate tool for obtaining 
information about protein structures, but the introduction 
of nuclear magnetic resonance (NMR) as a technique to 
obtain protein structures made it possible to obtain data 
with high precision in an aqueous environment much clo- 
ser to the natural surroundings of living organism than the 
crystals used in crystallography [1]. 

The NMR technique provides a set of inter-atomic dis- 
tances for certain pairs of atoms of a given protein. The 
molecular distance geometry problem (MDGP) arises in 
NMR analysis context. The MDGP consists of finding 
one set of atomic coordinates such that a given list of 
geometric constraints are satisfied [2]. Formally, the 
molecular distance geometry problem can be defined as 
the problem of finding Cartesian coordinates 
Xi, . . . G of atoms of a molecule such that lij < \ \ 
Xi ' XjW < Uip V(/, f) G £, where the bounds 4> and Uij for 
the Euclidean distances of pairs of atoms /) g E are 
given a priori [3]. 
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As suggested by Crippen and Havel [3], the MDGP can 
also be formulated as the global optimization problem of 
minimizing the function 

/W = Pij[Xi-Xj), 

where the pairwise function pij : ^ M is defined 
by 

Clearly, x = {xi, . . . G M^" solves the MDGP if, 
and only if, x is 3. global minimizer offdindjix) = 0. 

An overview on methods applied to the MDGP is 
given in [4] and a very recent survey on distance geome- 
try is given in [5]. 

Particular cases of the MDGP can be solved in a rela- 
tively easy way. For instance, when we know all dis- 
tances dij = \\Xi ' Xj\\, i.e., dij = lij = Uij and £ = {1, 2, 
n}^, a solution can be obtained by factoring the distance 
matrix D = [dij] . Assuming that D = [dij] has the singu- 
lar value decomposition irLil^ = D, then x = UY}'^ is a 
solution for the exact MDGP defined by lij = Uij = d^j 
[3]. Even in the case where the set of known distances is 
incomplete, i.e., when some entries of the distance 
matrix D = [dij] is unknown, we can solve the MDGP in 
linear time using an iterative algorithm called geometric 
buildup [6]. First, this algorithm initializes a set B 
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(base) with the index of four points, whose distances 
between all of them are known. Then, the coordinates 
of the points in B are set using the singular value 
decomposition of the incomplete distance matrix D 
restricted to the base B > and the remaining unset coor- 
dinates Xj are calculated by solving the linear system 



{Xi, Xj) 



(1) 



where i e {ii,i2,hfU} C B and dij = \ \xj - Xi\\, The 
indexes /i, (2, i's, 14 can be chosen in an arbitrarily way, 
allowing us to choose another base subset when calcu- 
lating the coordinate of the next Xj. At each iteration, 
the index ; of the new coordinate Xj is inserted in the 
set B increasing the number of subsets {/i, is, i^} 
used as anchors to fix the remaining unset coordinates. 

Unfortunately, in practice, the NMR experiments just 
provide a subset of distances between atoms spatially 
close and the data accuracy is limited. Thus in the real 
scenario, the set E is sparse and lij < Uij. So, we just 
have bounds to some of the entries of the distance 
matrix D. In this situation, neither the singular value 
decomposition nor the buildup algorithm can be 
applied directly because they are both designed to deal 
with exact distances. In fact, the inaccurate and sparse 
instances of MDGP, where lij <Uip are much harder to 
solve as pointed by More and Wu who showed that the 
MDGP with inaccurate distances belongs to the NP- 
hard class of problems [7]. 

Our contribution is a new algorithm that can handle 
with inaccurate and sparse distance data. We propose 
an iterative method based on simple ideas: generate an 
approximated distance matrix D, take as base a clique in 
the graph that has D as a connectivity matrix, solve the 
system (1) and refine the solution using a nonlinear 
least-squares method. It needs to be pointed that the 
authors of the buildup algorithm and coworkers have 
done some modifications in the original form of the 
algorithm in order to handle inaccurate data [8,9]. How- 
ever, the main advantage of our proposal is its simplicity 
and robustness. We have been able to find solutions 
with acceptable quality to instances of MDGP with inac- 
curate and sparse data, considering up to thousands of 
atoms. 

The new iterative method 

Defining the initial base 

The set E of pairs (/, 7) and the set of indexes V = {1, 2, 
n} can be considered as a set of edges and a set of 
vertexes of a graph G = (V, £), respectively. One may 
decide to use as base the biggest complete subgraph of 
G. The problem of calculating the biggest complete sub- 
graph belongs to the NP-complete class and it has a 



large number of applications (for a review in this subject 
consult [10]). We decided to use the algorithm cliquer 
proposed by Ostergard in [11,12] mainly because its good 
behavior in graphs of moderately size and its availability 
on the Internet [13,14]. The cliquer algorithm uses a 
branch-and-bound algorithm developed by Ostergard 
[15], which is based on an algorithm proposed by Carra- 
ghan and Pardalos [16]. 

Setting the coordinates 

Once we have obtained the base B associated with a 
complete subgraph using the algorithm cliquer, we 
need to set its coordinates. In order to generate an 
approximated Euclidean distance matrix (EDM) 
restricted to the points in the base, we define a matrix 
D{t) = [dij{t)l where 



dij{t) = (1 - tij)lij + tijUij 



(2) 



for tij G [0, 1] for each (/, y) g £. With this choice, we 
have lij < dij < Uip but D may not be an EDM with 
appropriated embedding dimension (/: = 3). This may 
happen because the entries d^ can violate the triangular 
inequality d^ < dij^ + djj^ for some indexes /, k, or 
because the rank of D is greater than 3. With this in 
mind, instead of considering the solution given by sin- 
gular value decomposition directly, we take the columns 
(eigenvectors) of U associated with the 3 largest eigenva- 
lues, getting the best 3-approximation rank of the solu- 
tion to xx^ = D{t) [17]. 

Refinement process 

We should not expect great precision in Xy because the 
matrix D{t) is just an approximation. Then, we try to 
refine it by minimizing the nonlinear function 



mm 0x,r {x) = ^ {x, I, u), 

{i,j)eE:i,jeB 

where 

/, u) = X[lij - Uij) + O'l^lx, I) + O'l^lx, u), 

and 



(3) 



with X >0, r >0. The parameter r controls the smooth- 
ness degree and X controls the intensity (weight) of the 
penalty function (see Figure 1). 

The function (p^.^^^ is infinitely differentiable with 
respect to x^ and therefore allows the application of clas- 
sical optimization methods. The function is a varia- 
tion of the hyperbolic penalty technique used in [18,19]. 
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T = 0.1, A = 1.0 

T = 0.1, A = 2.0 

T = 0.3, A = 1.0 




Figure 1 The hyperbolic smooth penalty function. 

parameter z controls the smoothness and the parameter A is 
related to the intensity of the penalty. 



The 



In order to minimize the function (t)^^x> we used the local 
minimization routine va35 encoded in FORTRAN and 
available at Harwell Subroutine Library. The routine 
va35 implements the method BFGS with limited mem- 
ory [20] (For additional information on this routine, see 
[21]). 

Once we have refined the coordinates of the points in 
the base B > we start to set the remaining (free) points. 
We begin with the points that have at least four con- 
straints with the points in the base. In order to set the 
coordinate Xp instead of using just four constraints invol- 
ving the index ; (like in the original version of the buildup 
algorithm), we use all constraints involving the index ; and 
the indexes in the base. Explicitly, to set the coordinate Xp 
we use the approximated distance matrix D{t) for some t 
G [0, 1]'^', solve the linear system 



- £-. + £- 
{xi,Xj) = — ^ i G B, 



(4) 



and then we refine the solution by minimizing the 
function (px^zM restricted to the index ; and to the 
indexes in the base (see eq. (3)). Each newly calculated 
coordinate is included in the base. In the end, some 
points may not be fixed because they have less than 
four constraints involving the points in the base. In this 
case, we just position these points solving an undeter- 
mined system defined by constraints with points in the 
base. Our presented ideas are compiled in the algorithm 
Isbuild (see Additional file 1). 

Methods 

We have implemented our algorithm Isbuild in Matlab 
and tested it with a set of model problems on an Intel 



Core 2 Quad CPU Q9550 2.83 GHz, 4GB of RAM and 
Linux OS-32 bits. In all experiments the parameters of 
the function 0;t,r of the algorithm Isbuild were set at 
A = 1.0 and at r = 0.01. 

We compared our results with the algorithms dgsol 
and buildup. The algorithm dgsol proposed by More 
and Wu in [22] uses a continuation approach based on 
the Gaussian transformation 



of the nonsmooth function 



lly-x|| 

A2 



dy 



fix) 



J2 



-Xj), 



where the potentials pij are given by 



pijix) = max 



The algorithm dgsol starts with an approximated 
solution and, given a sequence of smoothing parameters 
Xo > Xi > ... > Xp = 0, it determines a mmimizer X/^+i of 
(f)x^ The algorithm dgsol uses the previous minimizer X/^ 
as the starting point for the search. In this manner a 
sequence of minimizers Xi, Xp+i is generated, with the 
Xp+i a minimizer of /and the candidate for the global 
minimizer. In our experiments, we used the implemen- 
tation of the algorithm dgsol encoded in language C 
and downloaded from [23]. 

We also compared our results with the ones obtained 
by the version of the algorithm buildup proposed by 
Sit, Wu and Yuan in [8]. The algorithm buildup starts 
defining a base set using four points whose distances 
between all of them are known (a clique of four points). 
Then, at each iteration, a new point X/^ with known dis- 
tances to at least four points in the base is selected. In 
order to avoid the accumulation of errors, instead of 
just positioning the new point, in the modified version 
of the algorithm buildup the entire substructure formed 
by the point x^ and its neighbors in the base is calcu- 
lated by solving the nonlinear system 

. df^- df. + df^ 

{XuXj)=^ ^ ^,Vf,jGB 

with variables Xi = [xj , xf , x^), Xj = [xj,xj,x^) g 
and B being the set formed by the index k and the 
indexes of all neighbors of X/^ in the current base set. 
The parameters d/^j are the given distances between the 
node Xk and its neighbors Xj in the base and, for the 
nodes Xj and Xi already in the base, if the distance 
between them is unknown, we consider dif = \ \Xi - xM, 
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Once the substructure is obtained, it is inserted in the 
original structure by an appropriated rotation and trans- 
lation and the point is included in the base. This pro- 
cess is repeated until all nodes are included in the base. 
We have implemented the buildup algorithm in Matlab. 

Our decision to compare the Isbuild with the algo- 
rithms dgsol and buildup is mainly motivated by theirs 
similarities with our proposal. In fact, the algorithm 
dgsol uses a smooth technique in order to avoid the 
local minimizers and the algorithm buildup solves a 
sequence of systems which produce partial solutions 
and iteratively try to construct a candidate to global 
solution. Our algorithm combines some variations of 
these two ideas. We use a hyperbolic smooth technique 
to insert differentiability in the problem and a divide- 
and-conquer approach based in sucessive solutions of 
overdetermined linear systems in order to construct a 
candidate to global solution. 

In our experiments, the distance data were derived 
from the real structural data from the Protein Data 
Bank (PDB) [24]. It needs to be pointed that each of the 
algorithms considered has a level of randomness, the 
algorithm dgsol takes random start point and the algo- 
rithms Isbuild and buildup starts with an incomplete 
random matrix D = [dij] where lij < dij < Uij. So, in 
order to do a fair comparison, we run each test 30 
times. 

We considered two set of instances. The first one was 
proposed by More and Wu in order to validate the algo- 
rithm dgsol [22]. This set is derived from the three- 
dimensional structure of the fragments made up of the 
first 100 an 200 atoms of the chain A of protein 
PDB:1GPV[25,26]. For each fragment, we generated a 
set of constraints considering only atoms in the same 
residue or the neighboring residues. Formally, 

E = {[hi) : Xi e R[k),Xj e [R[k) U R[k + 1)), Vfe}, 

where R{k) represents the /c-th residue. 
In this set of instances, the bounds lij and Uij were 
given by the equations 

kj = (1 - ^)dij, Uij = (1 + s)dij, 

where dij is the real distance between the nodes Xi 
and Xj in the known structure x''' of protein PDB:1GPV. 
In this way, all distances between atoms in the same 
residue or neighboring residues were considered. We 
generated two instances for each fragment by taking s 
equals to 0.00 and 0.08. 

In order to measure the precision of the solutions just 
with respect to the constraints, without providing any 
information about the original structure x"^, we use the 
function 



where 

eij = max{/y - \ \Xi - Xj\\, \ \Xi - Xj 1 1 - Uij, 0} 

is the error associated to the constraint /,•,• < 1 1 X^ — Xj 1 1 
< Uij: We also measured the deviation 

of the solutions generated by each algorithm with 
respect to the original solution x^ in the PDB files, using 
the fiinction 

RMSD = ^ min I |x* - Q(x - h) \ \f, (6) 

with h eW"^ and Q g M^""^ orthogonal. 

In the second experiment, we use a more realistic set of 
instances with larger proteins proposed by Biswas in [17]. 
Typically, just distances below 6A (lA = 10'^ cm) between 
some pair of atoms can be measured by NMR techniques. 
So, in order to produce more realistic data, we considered 
only 70% of the distances lower than = 6 A. To introduce 
noise in the model, we set the bounds using the equations 

lij = d*- max(0, 1 - \6ij\), Uij = d*(l + |), 

where d'^j is the true distance between atom / and 
atom / and ^ij'^.. (normal distribution). 

With this model, we generate a sparse set of constraints 
and introduce a noise in the distances that are not so 
simple as the one used in the instances proposed by 
More and Wu. 

Results and discussion 

In Table 1 we can see the results of the first experiment 
defined from the protein PDB:1GPV and all distances in 
the same or neighboring residues. The values show that 
the algorithms buildup and Isbuild worked better 
(lower LDME and RMSD and CPU time) than the algo- 
rithm dgsol in all instances. The algorithms buildup 
performed slightly better than the algorithm Isbuild 
being the fastest algorithm. Despite its simplicity, this 
set of instances worked as an indication of the correct- 
ness of our implementation of the buildup algorithm. 

Table 2 shows the results of the second experiment 
with more realistic data. We can see that our approach 
was more efficient than the algorithms buildup and 
dgsol that were not able to find good solutions in these 
harder instances. In this table, \ V\ is the number of 
atoms in the instance, and CPU time is given in sec- 
onds. We also point out that LDME was low and the 
RMSD was lower than 3.5A in all instances, which 
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Table 1 RMSD, LOME and the CPU time in seconds for PDB:1GPV protein. 



(LOME) 



Fragment with 100 atoms 



0.00 



(RMSD) 



{TIME) 



(LOME) 



0.08 



(RMSD) 



{TIME) 



dgsol 
buildup 
Isbuild 



8.29E-03 
3.50E-15 
6.47E-15 



3.93E-01 
1.46E-14 
1.20E-14 



3.61 E+00 
1.08E-01 
1.51 E-01 



3.31 E-03 
O.OOE+00 
O.OOE+00 



8.25E-01 
3.13E-01 
7.77E-02 



4.40E+00 
1.08E-01 
1.33E-01 



{LOME) 



Fragment with 200 atoms 



: 0.00 



{RMSD) 



{TIME) 



{LOME) 



{RMSD) 



{TIME) 



dgsol 
buildup 
Isbuild 



3.18E-02 
4.85 E- 15 
1.90E-14 



2.58E+00 
2.45E-14 
5.21 E-14 



1.48E+01 
3.11 E-01 
6.01 E-01 



4.00E-03 
O.OOE+00 
O.OOE+00 



2.45 E+00 
5.18E-01 
4.21 E-01 



1.73E+01 
3.11 E-01 
5.25E-01 



Results for the fragments made up with the first 100 and 200 atoms of protein PDB:1GPV. The {LOME) and {RMSD) represent the LOME and RMSD measures 
respectively and <7"/ME> represents the mean time in seconds. 



Table 2 RMSD and LOME for the larger instance set. 



{LOME) {RMSD) 



PDB 


M 


Isbuild 


buildup 


dgsol 


Isbuild 


buildup 


dgsol 


1PTQ 


402 


2.61 E-03 


1.80E+00 


5.41 E-01 


1.31 E-02 


9.49E+00 


6.89E+00 


1LFB 


641 


2.03E-04 


1.84E+00 


3.91 E-01 


4.19E-03 


1.23E+01 


5.48E+00 


1AX8 


1003 


2.00E-04 


1.83E+00 


4.33E-01 


1.62E-02 


1.35E+01 


7.95E+00 


1F39 


1534 


3.03E-02 


1.89E+00 


4.74E-01 


4.22E-01 


1.79E+01 


1.28E+01 


1RGS 


2015 


1.08E-01 


1.87E+00 


4.73E-01 


1.74E+00 


1.92E+01 


1.35E+01 


1KDH 


2846 


1.39E-02 


1.86E+00 


5.19E-01 


9.43E-02 


2.11E+01 


1.61 E+01 


1BPM 


3671 


2.20E-02 


1.90E+00 


5.14E-01 


7.86E-02 


2.29E+01 


1.55E+01 


1T0A 


4292 


6.90E-03 


1.89E+00 


6.75E-01 


2.56E-01 


2.52E+01 


2.39E+01 


1MQQ 


5681 


1.93E-02 


1.91 E+00 


8.86E-01 


1.89E-01 


2.50E+01 


2.50E+01 



Results with instances considering just 70% of the distances below 6A. The {LOME) and {RMSD) represent the mean LDME and mean RMSD respectively. 



means that the algorithm is robust and able to find pro- 
tein structures very similar to the original ones [1]. The 
results in Table 3 shows that the buildup algorithm was 
again the fastest. The CPU time of the algorithm Isbuild 
was in the average around to 2.45 times the time con- 
sumed by the algorithm buildup, this fact must be miti- 
gated by the better quality of the solutions obtained be 
the algorithm Isbuild. 



Table 3 TIME for the larger instance set. 


PDB 




{TIME) 




Isbuild 


buildup 


dgsol 


1PTQ 


9.99E-01 


5.34E-01 


1.03E+01 


1LFB 


1.86E+00 


1.01 E+00 


2.55E+01 


1AX8 


2.98E+00 


1 .70E+00 


4.36E+01 


1F39 


7.21 E+00 


3.57E+00 


8.59E+01 


1RGS 


1.43E+01 


4.70E+00 


1.33E+02 


1KDH 


2.12E+01 


7.28E+00 


2.09E+02 


1BPM 


2.47E+01 


8.04E+00 


2.99E+02 


1T0A 


3.93E+01 


1.14E+01 


7.03 E+02 


IMQQ 


3.93E+01 


1.82E+01 


7.63E+02 



The mean CPU time in seconds with the instances considering just 70% of the 
distances below 6A. 



Finally, the results of both set of instances indicate 
that our algorithm Isbuild based on the combination of 
the resolution of linear systems, derived from the 
approximated EDM matrices, and the refinement pro- 
cess based on hyperbolic smoothing penalty is a very 
effective strategy to solve MDGP instances with sparse 
and inaccurate data. 

Conclusions 

We presented a new algorithm to solve molecular dis- 
tance geometry problems with inaccurate distance data. 
These problems are related to molecular structure cal- 
culations using data provided by NMR experiments 
which, in fact, are not precise. Our algorithm combines 
the divide-and-conquer framework and a variation of 
the hyperbolic smoothing technique. The computational 
results show that the proposed algorithm is an effective 
strategy to handle uncertainty in the data. 

Additional material 



Additional file 1: Algorithm Isbuild. 
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